AITopics | Gulf of Mexico

Collaborating Authors

Gulf of Mexico

Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

arXiv.org Artificial IntelligenceMay-13-2025

Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.

large language model, linear layer, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2505.06633

Country:

North America > Mexico > Gulf of Mexico (0.04)
North America > Dominican Republic (0.04)
Europe > Italy > Sardinia (0.04)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Fang, Haoquan, Grotz, Markus, Pumacay, Wilbert, Wang, Yi Ru, Fox, Dieter, Krishna, Ranjay, Duan, Jiafei

arXiv.org Artificial IntelligenceFeb-11-2025

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-enabled robotic systems. Project page: https://sam2act.github.io/

machine learning, reinforcement learning, sam2act, (17 more...)

arXiv.org Artificial Intelligence

2501.18564

Country:

North America > Mexico > Gulf of Mexico (0.14)
South America > Peru (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
(2 more...)

Add feedback

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

Demircan, Can, Saanum, Tankred, Jagadish, Akshay K., Binz, Marcel, Schulz, Eric

arXiv.org Artificial IntelligenceOct-2-2024

In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs' in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (RL) problems, in-context. Through three different tasks, we first show that Llama $3$ $70$B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and $Q$-values by performing carefully designed interventions on them. Taken together, our work establishes a methodology for studying and manipulating in-context learning with SAEs, paving the way for a more mechanistic understanding.

large language model, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2410.0128

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
South America > Peru > Loreto Department (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

L4GM: Large 4D Gaussian Reconstruction Model

Ren, Jiawei, Xie, Kevin, Mirzaei, Ashkan, Liang, Hanxue, Zeng, Xiaohui, Kreis, Karsten, Liu, Ziwei, Torralba, Antonio, Fidler, Sanja, Kim, Seung Wook, Ling, Huan

arXiv.org Artificial IntelligenceJun-14-2024

We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input - in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM [49], a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.10324

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
North America > United States > Oklahoma > Beaver County (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

Zhong, Jincheng, Guo, Xingzhuo, Dong, Jiaxiang, Long, Mingsheng

arXiv.org Artificial IntelligenceJun-6-2024

Diffusion models have significantly advanced the field of generative modeling. However, training a diffusion model is computationally expensive, creating a pressing need to adapt off-the-shelf diffusion models for downstream generation tasks. Current fine-tuning methods focus on parameter-efficient transfer learning but overlook the fundamental transfer characteristics of diffusion models. In this paper, we investigate the transferability of diffusion models and observe a monotonous chain of forgetting trend of transferability along the reverse process. Based on this observation and novel theoretical insights, we present Diff-Tuning, a frustratingly simple transfer approach that leverages the chain of forgetting tendency. Diff-Tuning encourages the fine-tuned model to retain the pre-trained knowledge at the end of the denoising chain close to the generated data while discarding the other noise side. We conduct comprehensive experiments to evaluate Diff-Tuning, including the transfer of pre-trained Diffusion Transformer models to eight downstream generations and the adaptation of Stable Diffusion to five control conditions with ControlNet. Diff-Tuning achieves a 26% improvement over standard fine-tuning and enhances the convergence speed of ControlNet by 24%. Notably, parameter-efficient transfer learning techniques for diffusion models can also benefit from Diff-Tuning.

artificial intelligence, diffusion model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2406.00773

Country:

North America > Mexico > Gulf of Mexico (0.04)
Europe > United Kingdom (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Brain Stroke Segmentation Using Deep Learning Models: A Comparative Study

Soliman, Ahmed, Yousif, Yousif, Ibrahim, Ahmed, Zafari-Ghadim, Yalda, Rashed, Essam A., Mabrok, Mohamed

arXiv.org Artificial IntelligenceMar-25-2024

Stroke segmentation plays a crucial role in the diagnosis and treatment of stroke patients by providing spatial information about affected brain regions and the extent of damage. Segmenting stroke lesions accurately is a challenging task, given that conventional manual techniques are time consuming and prone to errors. Recently, advanced deep models have been introduced for general medical image segmentation, demonstrating promising results that surpass many state of the art networks when evaluated on specific datasets. With the advent of the vision Transformers, several models have been introduced based on them, while others have aimed to design better modules based on traditional convolutional layers to extract long-range dependencies like Transformers. The question of whether such high-level designs are necessary for all segmentation cases to achieve the best results remains unanswered. In this study, we selected four types of deep models that were recently proposed and evaluated their performance for stroke segmentation: a pure Transformer-based architecture (DAE-Former), two advanced CNN-based models (LKA and DLKA) with attention mechanisms in their design, an advanced hybrid model that incorporates CNNs with Transformers (FCT), and the well-known self-adaptive nnUNet framework with its configuration based on given data. We examined their performance on two publicly available datasets, and found that the nnUNet achieved the best results with the simplest design among all. Revealing the robustness issue of Transformers to such variabilities serves as a potential reason for their weaker performance. Furthermore, nnUNet's success underscores the significant impact of preprocessing and postprocessing techniques in enhancing segmentation results, surpassing the focus solely on architectural designs

artificial intelligence, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

2403.17177

Country:

Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
Asia > Japan (0.04)
North America > Mexico > Gulf of Mexico (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks

Zhu, Rui-Jie, Zhao, Qihang, Zhang, Tianjing, Deng, Haoyu, Duan, Yule, Zhang, Malu, Deng, Liang-Jian

arXiv.org Artificial IntelligenceDec-9-2023

Spiking Neural Networks (SNNs) are attracting widespread interest due to their biological plausibility, energy efficiency, and powerful spatio-temporal information representation ability. Given the critical role of attention mechanisms in enhancing neural network performance, the integration of SNNs and attention mechanisms exhibits potential to deliver energy-efficient and high-performance computing paradigms. We present a novel Temporal-Channel Joint Attention mechanism for SNNs, referred to as TCJA-SNN. The proposed TCJA-SNN framework can effectively assess the significance of spike sequence from both spatial and temporal dimensions. More specifically, our essential technical contribution lies on: 1) We employ the squeeze operation to compress the spike stream into an average matrix. Then, we leverage two local attention mechanisms based on efficient 1D convolutions to facilitate comprehensive feature extraction at the temporal and channel levels independently. 2) We introduce the Cross Convolutional Fusion (CCF) layer as a novel approach to model the inter-dependencies between the temporal and channel scopes. This layer breaks the independence of these two dimensions and enables the interaction between features. Experimental results demonstrate that the proposed TCJA-SNN outperforms SOTA by up to 15.7% accuracy on standard static and neuromorphic datasets, including Fashion-MNIST, CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture. Furthermore, we apply the TCJA-SNN framework to image generation tasks by leveraging a variation autoencoder. To the best of our knowledge, this study is the first instance where the SNN-attention mechanism has been employed for image classification and generation tasks. Notably, our approach has achieved SOTA performance in both domains, establishing a significant advancement in the field. Codes are available at https://github.com/ridgerchu/TCJA.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2206.10177

Country:

Asia > China (0.04)
North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
North America > Mexico > Gulf of Mexico (0.04)
(2 more...)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.48)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

SageFormer: Series-Aware Framework for Long-term Multivariate Time Series Forecasting

Zhang, Zhenwei, Meng, Linghang, Gu, Yuantao

arXiv.org Artificial IntelligenceOct-26-2023

In the burgeoning ecosystem of Internet of Things, multivariate time series (MTS) data has become ubiquitous, highlighting the fundamental role of time series forecasting across numerous applications. The crucial challenge of long-term MTS forecasting requires adept models capable of capturing both intra- and inter-series dependencies. Recent advancements in deep learning, notably Transformers, have shown promise. However, many prevailing methods either marginalize inter-series dependencies or overlook them entirely. To bridge this gap, this paper introduces a novel series-aware framework, explicitly designed to emphasize the significance of such dependencies. At the heart of this framework lies our specific implementation: the SageFormer. As a Series-aware Graph-enhanced Transformer model, SageFormer proficiently discerns and models the intricate relationships between series using graph structures. Beyond capturing diverse temporal patterns, it also curtails redundant information across series. Notably, the series-aware framework seamlessly integrates with existing Transformer-based models, enriching their ability to comprehend inter-series relationships. Extensive experiments on real-world and synthetic datasets validate the superior performance of SageFormer against contemporary state-of-the-art approaches.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.01616

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
Oceania > Australia (0.04)
(7 more...)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.34)

Industry:

Health & Medicine (1.00)
Information Technology > Smart Houses & Appliances (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification

Fauvel, Kevin, Chen, Fuxing, Rossi, Dario

arXiv.org Artificial IntelligenceJun-5-2023

Traffic classification, i.e. the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real-world settings. Therefore, this paper introduces a new Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2202.05535

Country:

North America > Mexico > Gulf of Mexico (0.14)
Europe > France (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Telecommunications (0.69)
Information Technology > Security & Privacy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation

Wang, Song, Li, Wentong, Liu, Wenyu, Liu, Xiaolu, Zhu, Jianke

arXiv.org Artificial IntelligenceJun-4-2023

Semantic map construction under bird's-eye view (BEV) plays an essential role in autonomous driving. In contrast to camera image, LiDAR provides the accurate 3D observations to project the captured 3D features onto BEV space inherently. However, the vanilla LiDAR-based BEV feature often contains many indefinite noises, where the spatial features have little texture and semantic cues. In this paper, we propose an effective LiDAR-based method to build semantic map. Specifically, we introduce a BEV feature pyramid decoder that learns the robust multi-scale BEV features for semantic map construction, which greatly boosts the accuracy of the LiDAR-based method. To mitigate the defects caused by lacking semantic cues in LiDAR data, we present an online Camera-to-LiDAR distillation scheme to facilitate the semantic learning from image to point cloud. Our distillation scheme consists of feature-level and logit-level distillation to absorb the semantic information from camera in BEV. The experimental results on challenging nuScenes dataset demonstrate the efficacy of our proposed LiDAR2Map on semantic map construction, which significantly outperforms the previous LiDAR-based methods over 27.9% mIoU and even performs better than the state-of-the-art camera-based approaches. Source code is available at: https://github.com/songw-zju/LiDAR2Map.

artificial intelligence, lidar2map, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2304.11379

Country:

North America > Mexico > Gulf of Mexico (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)

Add feedback